Multiple property tolerance analysis for the evaluation of missense mutations.

Computational prediction of the impact of a mutation on protein function is still not accurate enough for clinical diagnostics without additional human expert analysis. Sequence alignment-based methods have been extensively used but their results highly depend on the quality of the input alignments and the choice of sequences. Incorporating the structural information with alignments improves prediction accuracy. Here, we present a conservation of amino acid properties method for mutation prediction, Multiple Properties Tolerance Analysis (MuTA), and a new strategy, MuTA/S, to incorporate the solvent accessible surface (SAS) property into MuTA. Instead of combining multiple features by machine learning or mathematical methods, an intuitive strategy is used to divide the residues of a protein into different groups, and in each group the properties used is adjusted.The results for LacI, lysozyme, and HIV protease show that MuTA performs as well as the widely used SIFT algorithm while MuTA/S outperforms SIFT and MuTA by 2%-25% in terms of prediction accuracy. By incorporating the SAS term alone, the alignment dependency of overall prediction accuracy is significantly reduced. MuTA/S also defines a new way to incorporate any structural features and knowledge and may lead to more accurate predictions.


Introduction
Computational prediction tools are needed to discover and prioritize candidate human disease alleles from uncharacterized human single nucleotide polymorphisms (SNPs). SNPs are now well known to play a critical but as yet largely uncharacterized role in human disease. However, experimental techniques able to identify deleterious mutations in proteins caused by SNPs are time-consuming and expensive. Although quantitative assessment algorithms do not replace clinically trained experts in diagnostic decisions, they are valuable tools in assisting with a diagnosis (Tchernitchko et al. 2004).
Two categories of algorithms (Saunders and Baker, 2002;Tchernitchko et al. 2004) have been developed recently to predict the mutation effect on protein function: phylogenetic (sequence alignment-based) and structural methods. Phylogenetic methods assume that functionally critical residues are conserved during the evolutionary process and use the phylogenetic information or the degree of conservation for each residue from the alignment of orthologs to predict the mutation effect (Cai et al. 2004;Krishnan and Westhead, 2003;Lau and Chasman, 2004;Mooney and Klein, 2002;Ng and Henikoff, 2001;Tavtigian et al. 2005). The SIFT method and server (Ng and Henikoff, 2001;Ng and Henikoff, 2002;Ng and Henikoff, 2003) is widely used for mutation effect prediction (Tchernitchko et al. 2004).
However, the 20 natural amino acids are intrinsically multi-dimensional in terms of physicochemical properties. For example, lysine (K) and leucine (L) have very similar size (volume) but very different charges and hydrophobicities. Consider the case of a mutation from the wild-type leucine to lysine at a position where phenylalanine (F) and glutamine (Q) have been observed in orthologs. Phenylalanine, leucine, glutamine, and lysine are all similar in size, although very different in other properties. To simultaneously take multiple physicochemical properties into account, Tavtigian et al. used three physicochemical properties to defi ne the "physicochemical" distance of residue types at a given alignment position and predicted the mutation effect based on this defi nition of distance (Tavtigian et al. 2005). A similar algorithm, MAPP, was developed by Stone and Sidow (Stone and Sidow, 2005) where six physicochemical properties were transformed to orthonormal properties and the physicochemical distance was calculated as a measure to classify mutation effect.
On the other hand, structural approaches attempt to capture the structural or environmental impact of mutation on the target protein residue (Herrgard et al. 2003;Sunyaev et al. 2001;Wang and Moult, 2001;Wang et al. 2003). Attempts to combine both categories of methods are making progress (Bao Evolutionary Bioinformatics 2006: 2 321-332 andCui, 2005;Ramensky et al. 2002;Saunders and Baker, 2002) by incorporating structural information to complement the alignment-based approaches. Saunders and Baker utilized both classifi cation tree and logistic regression classifi er methods to combine multiple predictors, including the SIFT score and other structural features. Ramensky et al, in their PolyPhen server (http:// www.bork.embl-heidelberg.de/PolyPhen/), used a set of empirical structure-based rules to predict the mutation effect. Bao and Cui derived several environmental parameters, along with the SIFT score, as the input factors for their support vector machine (SVM) and random forest (RF) methods.
A different approach, PMut by Ferrer-Costa et al. (Ferrer-Costa et al. 2002;Ferrer-Costa et al. 2004), utilizes the neural network learning technique (NN) from a large set of known data to predict the mutation effect in human genes and demonstrates the best prediction accuracy reported so far when the 3D structure information is used. PMut is very powerful for predicting the mutation effect for human genes. However, PMut uses existing mutation data as the base for prediction and, when only considering algorithm, should not be directly compared to other "ab initio" methods which only use input sequence alignments and know 3D structures for query sequences.
We here propose a novel strategy for mutation prediction. First we present a sequence alignmentbased method, Multiple Properties Tolerance Analysis (MuTA), similar to the method by Tavtigian et al. (Tavtigian et al. 2005) and the MAPP algorithm by Stone and Sidow (Stone and Sidow, 2005) since a set of physicochemical properties are used to measure the degree of conservation for a certain alignment position. The difference is that MuTA calculates the importance for each property independently and selects the most conserved properties as the predictors, while Tavtigian's method and the MAPP algorithm use all physicochemical properties to calculate the distances.
Secondly, we propose a novel approach, MuTA/S, to incorporate the structural features and the sequence alignment containing evolutionary information. MuTA/S assumes that residue's functionality and therefore its prediction criteria should depend on its local environment. MuTA/S defi nes various regions according to their environments and treats each region differently. The concept of region has been seen in methods like PolyPhen (Ramensky et al. 2002;Sunyaev et al. 2003;Sunyaev et al. 2001). In PolyPhen, the concept is used as one of the criteria used for the decision tree, while in MuTA/S it is used to divide a protein into different regions. For each region, the same algorithm but region-optimized parameters are used to perform the prediction.
MuTA already considers local environment effect by selecting different physicochemical properties according to their evolutionary conservation for individual alignment position. However, the evolutionary information from alignment is usually not suffi cient, and it is almost impossible to obtain alignments consisting of "perfect" ortholog sequences. As mentioned by Saunders and Baker, structural features, especially SAS, are found useful to increase the prediction accuracy. Unlike Saunders and Baker (Saunders and Baker, 2002) or Bao and Cui's (Bao and Cui, 2005) methods, in which mathematical treatments or machine learning methods are used to combined various structural features and the SIFT score, MuTA/S groups the residues of the target protein into several regions according to one or a few structural features (in this paper, only solvent accessible area, SAS, is used) and treats each region differently.

MuTA Algorithm
The MuTA algorithm fi rst selects the most important physical properties according to the conservation of the properties, and then, for the selected important properties, calculates the deviation of the mutation for a given alignment position. A mutation is determined benign based on whether the deviation is smaller than an empirically determined cutoff or not.
To quantitatively defi ne the degree of deviation for a certain property at an alignment position, we fi rst calculate the mean and the standard deviation of a property from the distribution containing all existing types of amino acids at this position, which we denote as x where x i denotes the ith physicochemical property, k is the alignment position, n is the index for different types of amino acids occurring at this position, and N is the total number of types of amino acids at this position. For example, if there are three type residues at the 100th position of an alignment, A, D, and E, and x i is the net side-chain charge, x i k can be calculated as (0 + (-1) + (-1))/3 = -2/3 and σ ( ) x i k = 0.47140. MuTA assumes that different physicochemical properties will have different degrees of conservation at different positions. We defi ne the relative importance of the ith property x i at the kth alignment position, I i k , is defi ned as where σ ( ) x i NAT is the standard deviation calculated from the distribution formed by all 20 natural amino acids while σ ( ) x i k is the standard deviation calculated from the distribution formed from the k th alignment position. A smaller I i k means the corresponding property is more conserved than the natural distribution of this property and thus thus greater importance.
Given a mutant of certain amino acid type µ at the k th position, the relative deviation of σ µ k from the above distribution is defi ned as: where ∆ is a small real number. The purpose of ∆ is twofold: fi rst to avoid the divided-by-zero error in the case of a totally conserved position where σ ( ) x i k is apparently zero, second to allow some degree of tolerance for a conserved position, e.g. for a conserved position with wild-type residue D, we may change the value ∆ so that a mutation to E is considered benign. If ∆ = 0.058 is used, a mutation to K (charge = +1) in the above example will give σ µ k = |+1-(-2/3)|/(0.47140 + 0.058) = 3.1482.
It is clear that the relative deviation of σ µ k is a measure of deviation of the mutation from the distribution formed by the existing amino acid types at the alignment position. Hence we use it as the criterion to defi ne the mutation as benign or deleterious. A constant threshold, τ , is used as: where M is the total number of properties. The square-root dependency is determined empirically. The determination of empirical parameters, ∆ and τ , is described in the Implementation section.
Thus we have defi ned a way to measure the deviation of a physicochemical property for a certain mutation from the distribution formed by the alignment.
The MuTA procedure can be summarized as follows: First an alignment containing the query sequence is obtained and a set of physicochemical properties is chosen. For each property at each position, σ σ σ µ ( ) , ( ) , , and are calculated from the above equations. The most important properties are selected according to I i k . In this set of "most important properties," if there is any property that is considered having deleterious mutation according to Eq.(4), then this mutation is considered deleterious.
In the above equations, we do not consider different weighting of each sequence when alignments are relatively small (no more than 30 sequences). For alignments with large number of sequences ( > 30 sequences), each sequence is given equal weighting so that a single sequence or alignment error will not pollute the results signifi cantly.

MuTA/S Algorithm
In MuTA/S algorithm, the "region" concept is added into the original MuTA algorithm as follows. For each alignment position, the user can defi ne the region it belongs to. Each region is treated like a separate MuTA system. Different sets of properties can be used and all MuTA parameters are optimized within an individual region. For example, one can defi ne a region in which only charge and side chain size are important so that only these properties are used for MuTA prediction and the cutoff constant will be adjusted.
Regions can be manually defi ned when sufficient knowledge on the local environment is available. They also can be automatically defi ned by one or more certain structural properties if they (4) can be calculated from the 3-D structure. In this paper we use relative SAS to classify residues of a protein into four regions: Region 1: Relative SAS ≥ 50% Region 2: 50% > Relative SAS ≥ 30% Region 3: 30% > Relative SAS ≥ 20% Region 4: 20% > Relative SAS ≥ 0% The specifi c sets of parameters for those regions are described in the System and Method section.

Property Selection
The selection and exclusion of the appropriate properties are critical to the MuTA approach. It is important for MuTA that the properties are distinct or relatively orthogonal from each other, and that the appropriate number of properties are included. Aaindex v7.0 (Kawashima et al. 1999) lists 516 one dimensional properties and 83 substitution matrices for the 20 natural amino acids. Tomii and Kanehisa (Tomii and Kanehisa, 1996) have analyzed the similarities in 402 amino acid properties using a single-linkage hierarchical cluster analysis, and visualized them using a minimum spanning tree. They demonstrated that amino acid properties can be roughly divided into 6 groups: hydrophobicity, composition, physicochemical properties, beta sheet propensity, alpha helix and turn propensity, and other.
To select properties that were well understood, in frequent use, as distinct from each other as possible, and represent the available properties adequately, we utilized a clustering and visualization technique known as self-organizing maps (SOM). SOM reduces the dimensionality of data through self-organizing neural networks (Kohonen, 1988). Each of the 516 amino acid one dimensional properties was scaled and centered, and then clustered and visualized with the SOM toolbox for Matlab (Vesant, 1999). Each amino acid has its own unique SOM maps for all properties. Properties corresponding to the most dissimilar points in the SOM maps among all amino acids are chosen. The properties chosen are listed in Table 1. Currently, the purpose of the described SOM technique is only to select distinct properties via an easy and visual procedure and should not be treated as a theoretically robust method that may improve the predictive power. We also have not tested the prediction power by different sets of parameters.
Besides the properties used, three empirical parameters also were defi ned. The fi rst parameter, τ , is the threshold to determine when a mutation score is considered damaging. The second parameter, ∆ , is a small real number to avoid potential numerical divided-by-zero error. The third parameter, f, is the ratio specifying the amount of important properties to be considered. For example, if there are 12 properties and f = 4, the fi rst 3 (since 12/4 = 3) most important properties are considered for each position.
In this paper the following parameters are used for MuTA, which are empirically tuned for LacI: τ = 2.23, ∆ = 0.058, and f = 4. For MuTA/S, each region has its own optimized parameter set.
Apparently, physicochemical properties may be highly corrected and their value may be in totally different and unrelated units. Two steps are taken to correct for these two problems. First all properties are normalized: each original property value is subtracted by the mean value and then divided by the standard deviation where the mean and the standard deviation are calculated through the distribution of all 20 amino acids. Principle Component Analysis (PCA) is then used to transfer the normalized properties to mutually orthonormal properties. This set of orthonormal properties was used subsequently in this paper although users can turn the normalization or PCA step off through MuTA's XML input fi le. Table 2 shows the contribution from all 10 properties for all principle components.

Region Defi nition
SAS was chosen as the region classifi er for the MuTA/S algorithm. While there are important exceptions, it has been widely accepted that solvent exposed residues will likely undertake random mutation without affecting the protein function or the binding between the protein and substrates or ligands. Hence we defi ne the fi rst region, Region 1, as consisting of all solvent-exposed residues (relative SAS ≥ 50%). All mutations in this region are considered benign. Table 3 shows the correlation between relative SAS and the standard deviations for ten properties for LacI. A higher correlation coeffi cient means the standard deviation for the property is larger when relative SAS is larger. Larger standard deviation for a property means that the property is less conserved or less important, as describe in the previous section aneeerd Eq.
(2). Based on this argument, Region 2, consisting of the second most exposed residues (50% > Relative SAS ≥ 30%), uses only a subset of properties. This subset of properties is formed by removing the four properties with highest correlation coeffi cients in Table  3. The removed properties are Hopp-Woods hydrophobicity, average accessible area in proteins, side chain charge, and solvation energy. A cutoff with larger value, τ = 15, is used, which refl ects the fact that the more exposed residues should have higher cutoff to be considered deleterious. Region 3, which consists of residues with relative SAS sitting between 20% and 30%, uses a lower cutoff, τ = 9.5. Region 4, consisting of buried residues, uses the lowest cutoff, τ = 2.13, and all properties are selected ( f = 1). The parameters for each region are manually tuned against LacI and Lysozyme with manually curated alignments. The original MuTA parameters, empirically tuned for LacI, are τ = 2.23, ∆ = 0.058, and f = 4. The defi nition of each region and their parameters are listed as follows: Region 1: Relative SAS ≥ 50%: All mutations are treated as benign. Region 2: 50% > Relative SAS ≥ 30%: τ = 15, ∆ = 0.058, and f = 2; a subset of properties. Region 3: 30% > Relative SAS ≥ 20%: (6) τ = 9.5, ∆ = 0.058, and f = 1; all properties. Region 4: 20% > Relative SAS ≥ 0%: τ = 2.13, ∆ = 0.058, and f = 1; all properties.

Protein Systems
Three protein systems were used in this study, LacI (Markiewicz et al. 1994), T4-Lysozyme (Rennell et al. 1991), and HIV-1 protease (Loeb et al. 1989 (Fraczkiewicz and Braun, 1998). The default atomic van der Waals parameters and a standard 1.4 Ǻ solvent probe radius were used.
All SIFT results reported here were calculated by the SIFT program downloaded from the SIFT server (http:http://blocks.fhcrc.org/sift/SIFT.html). All experimental results are also taken from the SIFT server. The MAPP program was downloaded from http://mendel.stanford.edu/SidowLab/downloads/MAPP/MAPP.html.
The MuTA (including MuTA/S) program was implemented in ANSI C++ with Standard Template Library (STL) and Template Numerical Toolkit (TNT) for Linear Algebra (Barrett et al. 1994). The input parameters and output results are in XML format and are extensible and portable. The program runs on Window XP SP2 and Linux RedHat 9.0. The compilers used are Microsoft Visual C++ 6.0 SP6 (Win32) and Intel C++ 9.0 (Linux). The test cases were performed on a Dell D600 notebook computer with a 1.7GHz Pentium-M CPU and 2 GB RAM. A web portal running MuTA and MuTA/S program is available to Quest Diagnostics' internal usage and will soon be available for public access.

Results and Discussion
LacI and T4-Lysozyme were chosen for tuning and benchmarking for both MuTA and MuTA/S. The experimental data are taken from the SIFT server (http://blocks.fhcrc.org/sift/SIFT.html). Since NR and SwissProt databases keep been updated and alignments from those databases will be different at different time, our SIFT results are slightly different from the original reported SIFT results. The term "benign" used in this paper has the same For the "Benign" entries, the fi rst number is the number of benign predictions and the second number is the total number of experimentally confi rmed benign mutations. "Deleterious" entries have a similar meaning but are for deleterious mutations.

2
The entries of "Prediction Accuracy" are the prediction percentages calculated by averaging the benign and deleterious percentages. meaning as the term "tolerant" in the SIFT paper by Ng and Henikoff (Ng and Henikoff, 2001). However, the term "deleterious" here refers to the mutation with strong deleterious effect while the same term used in the SIFT paper means all nontolerant mutations. The defi nition used here has been used recently (Stone and Sidow, 2005). These differences lead to different number of data points and different results for SIFT in our result. MuTA and MuTA/S was manually tuned for these alignments hence the results can be considered as "training data set". Table 4 shows the results from SIFT, MAPP, MuTA, and MuTA/S with different regions from human expert curated alignments. Similar performance is seen for MuTA, SIFT and MAPP. Nevertheless, MuTA/S gives better results when more regions are used. When all four regions are used, the overall prediction percentage of MuTA/S is signifi cant better than SIFT and MuTA. SIFT, in multiple studies, has given the good results to date for sequence-based prediction (Saunders and Baker, 2002;Tchernitchko et al. 2004). To our knowledge, the prediction percentages for LacI and Lysozyme by MuTA/S are the best results reported in the literature so far if empirical learning methods, such as PMut (Ferrer-Costa et al. 2002;Ferrer-Costa et al. 2004), are not considered. Further examining the results of Lysozyme, we found that most of false positive results (experiment = benign; prediction = deleterious) are from highly buried and totally conserved residues. It is almost impossible for an alignmentbased prediction method to correctly predict such mutations, unless detailed atomic-level information and/or interaction of this type of residues can be used in the prediction.
In addition to the above training data set, we performed MuTA/S analysis on LacI, Lysozyme, and HIV-1 protease (Loeb et al. 1989) with different alignments. The alignments, taken from the Swis-sProt database (Bairoch and Apweiler, 2000) and NCBI's non-redundant database (NR, http://www. ncbi.nlm.nih.gov/ ) through the SIFT server (http:// blocks.fhcrc.org/sift/SIFT.html), should be considered as "test data sets" since MuTA/S is not optimized against them. Those three sets of data (LacI, Lysozyme, and HIV-1 Protease) have been widely used as the benchmark sets of mutation effect prediction. The results are listed in Table 5, Table  6, and Table 7, respectively. In addition to these three sets of data, results for two highly interested genes, Cystic fi brosis transmembrane conductance regulator, CFTR (Riordan et al. 1989), and Glucose-6-phosphate dehydrogenase, G6PD (Kwok et al. 2002), are listed in Table 8. For all data sets, MuTA/S consistently outperforms SIFT and MuTA by 2% to 25%. We were not able to perform high throughput runs via PolyPhen or PMut web interfaces hence no direct comparison to MuTA/S can be made.
The main features of MuTA are that multidimensional predictors, which are physicochemical properties, are used rather than a single predictor, and that the selection of the predictors is based on the conservation of the predictors at the specifi c position. The use of multiple physicochemical properties as predictors combined with the Table 5. Prediction results from SIFT, MuTA, and MuTA/S using different alignments for LacI. Percentage entries are in the format of "overall percentage (benign percentage, deleterious percentage)". The overall percentage is the average of the benign percentage and the deleterious percentage. position-specifi c selection of the predictors has at least three advantages: Firstly, the prediction is more reliable since different predictors are used for different environments. Secondly, for a specifi c position, the predictor(s) chosen refl ect the importance of certain physicochemical properties at this position. This information can be further examined and rationalized with structural or other types of data. Thirdly, the choice of properties can be structural properties, e.g. at a given alignment position, solvent accessible areas for all sequences if their structures are known or can be modeled. Thus structure-specifi c properties can be used, while only amino acid type-specifi c data can be used in SIFT or other similar approaches. All major sequence alignment-based prediction methods, including SIFT, PMut, PolyPhen, or MAPP, ignore the inter-residue interaction, at least explicitly. The prediction of a mutation effect on covariant residues (Clarke, 1995) is very diffi cult, if not impossible. In such cases, successful prediction may require molecular dynamic or free energy perturbation simulations at the atomic level to understand the detailed interactions between residues, or co-variance analysis using the sequence alignment to assess the dependency rules for residues. Local conformational changes for sequences to sequences in the same alignment will void conservation analysis-based mutation prediction. Again, other algorithms or simulation tools are needed in such cases.
Because sequence-based methods, such as SIFT and MuTA, highly depend on accurate sequence alignments, an alignment consisting of widely spread ortholog sequences for the same function will be ideal. In practice, it is diffi cult to have such sets of sequences. Automated genome-wide methods would benefi t substantially if a way to distinguish and extract information from ortholog and paralog sequences is defi ned and employed in the conservation analysis. Because the goal is for these algorithms to support human experts making diagnostic testing decisions, we believe that careful manual preparation of alignments is a vital component for providing a useful sequencing assay service. This is a different approach than offered by web-based tools that operate across the entire human genome, in which alignments are generated automatically.
Saunders and Baker concluded that accurate mutation prediction requires suffi cient evolutionary information, but structural information may increase the accuracy when there is lack of evolutionary information (Saunders and Baker, 2002). Our results in Table 5, Table 6, and Table 7 clearly confi rm their conclusions, although the number of sequences seems not a determining factor. The average Shannon's Entropy (Shannon, 1984), I, for each alignment, which is a measure of the sequence divergence, is also listed in Table 5, Table 6, and  Table 7. Our results show that alignments with enough sequence divergence are critical. SIFT and MuTA results using SwissProt alignments are consistently better than the results using NR alignments, which is expected since a sequence search against the SwissProt database would return more Table 6. Prediction results from SIFT, MuTA, and MuTA/S using different alignments for Lysozyme. Percentage entries are in the format of "overall percentage (benign percentage, deleterious percentage)". The overall percentage is the average of the benign percentage and the deleterious percentage. conserved, it may be appropriate for one protein system to include conserved orthologs across all vertebrates, but in another system to include only mammals. The inclusion of paralogs (or paralogous domains) also frequently needs to be evaluated for each system. We selected sequences and performed curated alignments on only a small set of genes for clinical diagnostic purposes. Those curated alignments are for highest possible alignment-based preditions. However, for an automated prediction server for any gene, the greater stability of results against different alignments is critical. With genome-wide automated prediction, a user normally would supply an input sequence and probably request automatic sequence search against public sequence databases, such as SwissProt or NR databases, to build the alignment. In such cases, the quality of the input alignments could be not ideal for sequence alignment-based mutation predictions. A stable method like MuTA/S can at least give users reasonable results, although, unlike SIFT, currently MuTA and MuTA/S contain only the prediction algorithm and do not provide automatically sequence search and alignment functions, which can be easily done be various available tools such Psi-BLAST. Another alternative approach has seen in PMut (Ferrer-Costa et al. 2002;Ferrer-Costa et al. 2004), where a large set of data of a certain diverged sequences than the NR database due to the fact that NR has much more highly-similar sequences. When only NR sequences are used, SIFT and MuTA usually predict relatively poorly.
MuTA/S, on the other hand, not only outperforms SIFT or MuTA but is more stable against different alignments. For example, Table 5 shows that the maximum difference in prediction accuracy is around 23% for LacI when SIFT is used (50.26% vs. 72.93%), while it goes down to around 7% using MuTA/S (70.68% vs. 77.86%). For the three protein systems the worst prediction accuracy from MuTA/S is 69.81% while the lowest accuracy from SIFT and MuTA is as low as 50%. An accuracy of 50% means an incorrect prediction half of the time, or almost no distinguishing prediction power.
In commercial diagnostic testing, usually a limited number of gene tests are performed in very large volumes. The alignments must be correct, so hand curated alignments are both practical and critical to obtaining accurate prediction results. The alignment of sequences can easily be automated with many different alignment algorithms generally producing very similar results. However, the choice of which orthologs and paralogs can currently be done with better results by a human expert than a computer. For example, depending on the degree a particular protein function is Table 7. Prediction results from SIFT, MuTA, and MuTA/S using different alignments for HIV-1 protease. Percentage entries are in the format of "overall percentage (benign percentage, deleterious percentage)". The overall percentage is the average of the benign percentage and the deleterious percentage. gene is preprocessed and the prediction is based on the neural network learning results hence less alignment-dependent results would be expected. For MuTA/S, we already mentioned that it is not always true that solvent exposed residues will undertake random mutation without affecting the protein function or the binding between the protein and substrates or ligands. The Region 1 defi nition used in this paper considers all exposed residues (relative SAS ≥ 50%) benign, which is clearly not correct in some cases, although we found that this rule correctly predicts benign mutations for more than 80% of cases. To further increase the prediction accuracy, this issue must be addressed. The exceptions of the Region 1 rule could be: 1. the residues could play the role like a "gatekeeper" for controlling or specifying the substrates/ligands entrance or exit; 2. they could be important for protein-protein interactions; 3. the residues could be in fact not solvent-exposed in vivo; 4. other unknown reasons. To address those exceptions, detailed biochemical knowledge may be necessary. Furthermore, extra caution should be taken when calculating SAS for a protein. For example, SAS from HIV-1 protease dimmer with substrate/ligand should be used, not SAS from an HIV-1 protease monomer. Also, our current implementation is not able to deal with multiple structures for one alignment. One possible solution is that every alignment position is assigned to its SAS region according to the maximum SAS in the structures, since high SAS probably means the alignment position is more tolerate to the mutation.

HIV-PR
The concept of region in MuTA/S can be extended beyond SAS classifi cation. For example, consider a ligand binding site region where the properties of size and charge are important. Only these two properties could be used for mutation effect prediction within this region. Hence we define a way to incorporate structural and/or mechanism knowledge into prediction methods. This concept could be applied to protein systems where substantial structure-function knowledge is available and will lead to highly accurate prediction for specifi c protein systems. Such approach will improve prediction results further in well-studied systems. It also can be applied to other sequencebased approaches, such as SIFT or MAPP: the empirical parameters can be optimized for different regions and improved results should be expected.
In summary, we present the MuTA algorithm and its extension, the MuTA/S algorithm. MuTA provides a framework for mutation prediction methods while MuTA/S is based on this framework and utilizes SAS information into the prediction. Tests on LacI, Lysozyme, and HIV-1 protease show that MuTA/S signifi cantly improves the prediction accuracy and reduces the alignment dependency. The approach of MuTA/S also provides the possibility to incorporate other structural or mechanism knowledge to the mutation effect prediction. Table 8. Prediction results from SIFT, MuTA, and MuTA/S for CFTR and G6PD. Entries are in the format of "overall percentage (benign percentage, deleterious percentage)". The overall percentage is the average of the benign percentage and the deleterious percentage. 1 Alignments are human-curated. 2 The structure for CFTR is taken from PDB ID:2F9Q; the structure for G6PD is taken from PDB ID:1QKI.